home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.cs.arizona.edu
/
ftp.cs.arizona.edu.tar
/
ftp.cs.arizona.edu
/
icon
/
newsgrp
/
group94b.txt
/
000124_icon-group-sender _Fri Nov 25 14:30:48 1994.msg
< prev
next >
Wrap
Internet Message Format
|
1995-02-09
|
5KB
Received: by cheltenham.cs.arizona.edu; Fri, 25 Nov 1994 07:42:26 MST
Original-Via:
Pp-Warning: Illegal Via field on preceding line
From: ROBERT VAN DER ZWAN <RZWAN@dish.gla.ac.uk>
To: icon-group@cs.arizona.edu
Date: Fri, 25 Nov 1994 14:30:48 GMT
Subject: textual analysis/tilt project glasgow uni.
Priority: normal
X-Mailer: PMail v3.0 (R1)
Message-Id: <49B163D1DE4@dish.gla.ac.uk>
Errors-To: icon-group-errors@cs.arizona.edu
Tilt C History
Checklist textual analysis 9/11/94
1. Textual database preparation/running
* a. allowing import of extended-character set ASCII text
(including main European languages) Importing should be easy
to handle.
* b. recognition of simple mark up (for stucture of a text
(chapters, pages etc.) and for elements of content. Mark up of
structure of a text is essential for possibility of performing
searches in parts of the text.
- Mark up preferably SGML because of possibilities of
interchange.
- Also of importance: possibilities of (semi)-automatic markup.
- It should be possible to hide the mark-up.
2. Vocabulary overview (providing rough pointers to the nature and content
of a text).
* a. Type-token ratio
* b. Complete word list with frequency count, displayable both
in alphabetic order and in order of frequency, for all or a
predetermined part of the text. Also selected wordlist (as
opposed to complete)
c. token-character ratio (which should give rough average of
lenghts of words)
3. Content retrieval facilities.
NB. As much as possible of a-e should be done in conjunction and should be
subject to 'filtering' (treating only limited parts of the text)
* a. word searches including use of wild cards and Boolean
operators.
* b. combined search for user-defined clusters of semantically
unrelated but near synonymous words (noble, aristocr*)
* c. search for word pairs (f.e. social contract) and proximate
associates (mandatories of the people), rights of man/woman)
* d. search for roots and lemma's (f.e.: oligarchy, monarchy,
noble for ennoblement, nobility.
- this could be done by the use of wildcards, but preferable by
way of parsing.?
* e. collocation (including a user defined span) producing a z-
score, which indicates the measure of probability that words
are used together on purpose.
f. macine generated search strategy via thesaurus (preferably
user-trainable thesaurus, to accomodate variable historical
usage), were potential related words are offered from thesaurus
for confirmation or rejection by searcher.
4. Additional quantitative/stylistic facilities (extending basics of 2 above and
currently achievable only through combination of various software)
* a. enhancement of word frequency list (2/b.) by means of
statistical options to calculate how much unique words, twice
occurring words, and so on up to high frequency words
contribute both to the total vocabulary and to the total word
length (?) (useful to assess the audience for which an author
may conciously or unconsciously have wanted to address and
to refine the potentially misleading type-token ratio.
* b. graphical display of frequencies of unique words and so
on.
c. direct quantification of word - and sentence length (see 2c
above) (paragraph length is not meaningful for most historical
texts and therefore not necessary).
d. quantification of use of question marks, passive voice etc.
e. simple parsing to assist with 3 d-f. (allowing to exclude f.e.
all function words or search for nouns only etc.)
5 Display functions:
* a. keyword(s) displayed in full text (highlit), and in
concordance form (index, user-definable KWIC), giving location-
reference by line or marked-up section (chapter, page etc.) or
both.
* b. 'topographical' distribution display, showing clustering of
keyword(s) over the entire text or user-specified sections of
that text.
c. free movement between displays without the need for new
retrieval.
6 User facilities:
* a. simple interface for 'naive' users:
- all functions available by menu and or icon
- preferably Windows compatible
- step by step guidance through procedures
- no use of difficult terms, or good help function available.
* b. easy output of results (to printer, wordproccessor,
database package or spreadsheet), preferably by using cut and
paste option in Windows.
* c. reasonable speed of performance for complex retrievals
(f.e. collocations) and large bodies of text (2-5 Mb)